ceTe Software Help Library for Java August - 2020
DynamicPDF Merger for Java / Programming with Merger for Java / Text Extraction
In This Topic
    Text Extraction
    In This Topic

    Text Extraction

    The text extraction feature allows you to pull out the text from within a PDF document. Text can be extracted from an entire PDF document (using the GetText method of the PDFDocument class) or from within a certain page of a PDF (using the GetText method of the PdfPage class). The text returned from the GetText method is a string.

    There are a couple of things to keep in mind when using the GetText method for extracting text from within a PDF:

    • Text that is part of an image, a form field or a note/comment will not be extracted.
    • Text will be extracted from the PDF in the order in which the PDF operators are loaded in the existing PDF.
    • During evaluation mode, text extraction is limited to 256 characters.

    The following code will extract the text in an existing PDF document.

    [Java]
        // Create the PDF document object
    PdfDocument pdfA = new PdfDocument( "[PhysicalPath]/MyDocument.pdf");
    // Call the GetText method from PDF document object to get the text from the document
    String extractedText = pdfA.getText();

    The following code will extract the text from a specified page within a PDF.

    [Java]
       // Create the PDF document object
    PdfDocument pdfA = new PdfDocument( "[PhysicalPath]/MyDocument.pdf");
    // Call the GetText method a PDF page to get the text from that page
    String extractedText = pdfA.getPages().getPdfPage(1).getText();